Source: Michael Haslam
Source: Michael Haslam

Introduction

loading in the data:

library(curl)
## Using libcurl 8.10.1 with Schannel
f <- curl("https://raw.githubusercontent.com/clairezng/clairezh-AN588-Replication/main/DATASET_CARDOSO&OTTONI.csv")
d <- read.csv(f, header = TRUE, stringsAsFactors = FALSE)
head(d) # looks fine
##    ID POPULATION  SEX GROUP.AGE NUMVISIT TIME.VISIT Mean_TIMEVISIT FINGERS
## 1 TOR       SCNP MALE     ADULT       53       7715         145.57       0
## 2 BEI       SCNP MALE     ADULT       35       9974         284.97       1
## 3 ZAN       SCNP MALE     ADULT       63      10663         168.78       5
## 4 NIC       SCNP MALE     ADULT       42      11793         280.79       0
## 5 ZEN       SCNP MALE     ADULT       38       6513         171.39       2
## 6 CLA       SCNP MALE     ADULT       27       5570         206.30       0
##   PROBE.EVENT SUCCESSFUL NUMBER.OF.PROBE.TOOLS
## 1         489        457                    70
## 2         656        616                    52
## 3         655        632                    91
## 4         797        735                    89
## 5         548        518                    36
## 6         342        330                    29

descriptive statistics:

authors stated that the SCNP capuchins visited the boxes at a mean = 213 visits/day; SD = 55; total individual visits = 1067. the FBV capuchins visited the boxes at a mean = 29 visits/day; SD = 12; total individual visits = 376

partitioning the two populations into separate datasets

SCNP <- subset(d, POPULATION == "SCNP")
FBV <- subset(d, POPULATION == "FBV")
SCNP
##     ID POPULATION    SEX GROUP.AGE NUMVISIT TIME.VISIT Mean_TIMEVISIT FINGERS
## 1  TOR       SCNP   MALE     ADULT       53       7715         145.57       0
## 2  BEI       SCNP   MALE     ADULT       35       9974         284.97       1
## 3  ZAN       SCNP   MALE     ADULT       63      10663         168.78       5
## 4  NIC       SCNP   MALE     ADULT       42      11793         280.79       0
## 5  ZEN       SCNP   MALE     ADULT       38       6513         171.39       2
## 6  CLA       SCNP   MALE     ADULT       27       5570         206.30       0
## 7  BLP       SCNP   MALE       SUB       34       9010         265.00       0
## 8  CAP       SCNP   MALE  JUVENILE       47       9504         202.21       2
## 9  LIM       SCNP   MALE  JUVENILE       27       6219         230.33      11
## 10 COR       SCNP   MALE  JUVENILE       53       9659         182.25       7
## 11 VOL       SCNP   MALE  JUVENILE       64       9201         143.77      10
## 12 PAD       SCNP   MALE  JUVENILE       59       8175         138.56      10
## 13 DES       SCNP   MALE  JUVENILE       66       7281         110.32      13
## 14 CIN       SCNP   MALE  JUVENILE       77       7043          91.47       0
## 15 GOR       SCNP FEMALE     ADULT       56       7040         125.71     376
## 16 MAC       SCNP FEMALE     ADULT       69       8730         126.52     268
## 17 BEM       SCNP FEMALE     ADULT       58       7850         135.34     598
## 18 CAN       SCNP FEMALE     ADULT       27       4804         177.93     430
## 19 NIN       SCNP FEMALE     ADULT       17        977          57.47       7
## 20 LIC       SCNP FEMALE     ADULT       30       3514         117.13      36
## 21 ALI       SCNP FEMALE     ADULT       42       5131         122.17     318
## 22 VES       SCNP FEMALE     ADULT       16       2373         148.31      93
## 23 BAT       SCNP FEMALE  JUVENILE       67       7944         118.57     292
##    PROBE.EVENT SUCCESSFUL NUMBER.OF.PROBE.TOOLS
## 1          489        457                    70
## 2          656        616                    52
## 3          655        632                    91
## 4          797        735                    89
## 5          548        518                    36
## 6          342        330                    29
## 7          635        601                    31
## 8          784        766                    78
## 9          278        258                    42
## 10         805        726                   103
## 11         262        156                    25
## 12         150         37                    44
## 13          22          4                    12
## 14           0          0                     0
## 15           0         NA                    NA
## 16           0         NA                    NA
## 17           0         NA                    NA
## 18           0         NA                    NA
## 19           0         NA                    NA
## 20           0         NA                    NA
## 21           0         NA                    NA
## 22           0         NA                    NA
## 23           0         NA                    NA
FBV # looks okay!
##     ID POPULATION    SEX GROUP.AGE NUMVISIT TIME.VISIT Mean_TIMEVISIT FINGERS
## 24 TEI        FBV   MALE     ADULT       20       1162          58.10      13
## 25 MSN        FBV   MALE     ADULT       10        625          62.50       0
## 26 JTB        FBV   MALE     ADULT       38       1647          43.34       5
## 27 CAT        FBV   MALE       SUB       48       3169          66.02       5
## 28 COC        FBV   MALE  JUVENILE       38       3475          91.45      74
## 29 CNG        FBV   MALE  JUVENILE       11        757          68.82       7
## 30 PAT        FBV   MALE  JUVENILE       24       2507         104.46      32
## 31 DIT        FBV FEMALE     ADULT       59       3012          51.05       2
## 32 CHU        FBV FEMALE     ADULT       50       4336          86.72       6
## 33 PSS        FBV FEMALE     ADULT       18       2601         129.17      10
## 34 AMR        FBV FEMALE     ADULT        7        827         118.14       0
## 35 DOR        FBV FEMALE  JUVENILE        7        215          30.71       0
## 36 PAM        FBV FEMALE  JUVENILE       20       2325         130.05      13
## 37 PAS        FBV FEMALE  JUVENILE       18        948          52.67       0
##    PROBE.EVENT SUCCESSFUL NUMBER.OF.PROBE.TOOLS
## 24           0         NA                    NA
## 25           0         NA                    NA
## 26           0         NA                    NA
## 27           0         NA                    NA
## 28           0         NA                    NA
## 29           0         NA                    NA
## 30           0         NA                    NA
## 31           0         NA                    NA
## 32           0         NA                    NA
## 33           0         NA                    NA
## 34           0         NA                    NA
## 35           0         NA                    NA
## 36           0         NA                    NA
## 37           0         NA                    NA

calculating mean and SD number of visits for the SCNP population:

sum(SCNP$NUMVISIT)
## [1] 1067
mean(SCNP$NUMVISIT)
## [1] 46.3913
sd(SCNP$NUMVISIT)
## [1] 17.70141

the total number of visits adds up fine. maybe they made a typo? I don’t understand how the dataset I have could possibly turn out a mean of 213 visits/days of presentation.

looking at the FBV population:

sum(FBV$NUMVISIT)
## [1] 368
mean(FBV$NUMVISIT)
## [1] 26.28571
sd(FBV$NUMVISIT)
## [1] 17.19315

okay. well the total doesn’t even add up correctly, much less the mean and standard deviation hello? what am I doing wrong, because there’s no way THEIR data is just crazy incorrect.

trying the “Mann-Whitney test” they do. They say the Z = -4541, p < 0.0001. I’m honestly not even sure what they’re calculating the statistics of, but my best guess is that it’s comparing the frequency of probe use in the SCNP group vs. the FBV population. also, I’m assuming that the Mann-Whitney test is an unpaired two-sample Wilcoxon test

mw.probes <- wilcox.test(SCNP$PROBE.EVENT, FBV$PROBE.EVENT, paired = FALSE)
## Warning in wilcox.test.default(SCNP$PROBE.EVENT, FBV$PROBE.EVENT, paired =
## FALSE): cannot compute exact p-value with ties
mw.probes
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  SCNP$PROBE.EVENT and FBV$PROBE.EVENT
## W = 252, p-value = 0.0008902
## alternative hypothesis: true location shift is not equal to 0

this doesn’t give me enough information, except that the p-value isn’t correct, so maybe they ran the Mann-Whitney test on something else?

in the table, they list values for length of the direct engagement with the task and the mean time of visits, so I’ll try it for that too:

mw.time <- wilcox.test(SCNP$TIME.VISIT, FBV$TIME.VISIT, paired = FALSE)
mw.time # p-value = 2.987e-07, which could definitely be the p < 0.0001 value they're talking about
## 
##  Wilcoxon rank sum exact test
## 
## data:  SCNP$TIME.VISIT and FBV$TIME.VISIT
## W = 306, p-value = 2.987e-07
## alternative hypothesis: true location shift is not equal to 0
mw.meantime <- wilcox.test(SCNP$Mean_TIMEVISIT, FBV$Mean_TIMEVISIT, paired = FALSE)
mw.meantime # okay, this spits out a p-value = 5.623e-06, which could also be the p < 0.0001 value they list
## 
##  Wilcoxon rank sum exact test
## 
## data:  SCNP$Mean_TIMEVISIT and FBV$Mean_TIMEVISIT
## W = 294, p-value = 5.623e-06
## alternative hypothesis: true location shift is not equal to 0

these definitely seem more promising. now I need to figure out the Z-score of -4541, which seems absurd? I wonder if the “Z” is supposed to represent something else. Also, in order to assume a normal distribution, both \(n_{1}\) (the SCNP group) and \(n_{2}\) (the FBV group) would have to be more than 20, which they aren’t.

Nevertheless, we have to convert the W values the Wilcoxon Rank Sum test spits out into Mann-Whitney U values.

R tells me that R’s value can also be computed as the number of all pairs (x[i], y[j]) for which y[j] is not greater than x[i], the most common definition of the Mann-Whitney test. To ME, this seems like the W value is also the value of U.

Running into another problem here: 306 and 294 seem like the larger of the two U values, so I’ll reorder my variables

mw.time2 <- wilcox.test(FBV$TIME.VISIT, SCNP$TIME.VISIT, paired = FALSE)
mw.time2
## 
##  Wilcoxon rank sum exact test
## 
## data:  FBV$TIME.VISIT and SCNP$TIME.VISIT
## W = 16, p-value = 2.987e-07
## alternative hypothesis: true location shift is not equal to 0
mw.meantime2 <- wilcox.test(FBV$Mean_TIMEVISIT, SCNP$Mean_TIMEVISIT, paired = FALSE)
mw.meantime2
## 
##  Wilcoxon rank sum exact test
## 
## data:  FBV$Mean_TIMEVISIT and SCNP$Mean_TIMEVISIT
## W = 28, p-value = 5.623e-06
## alternative hypothesis: true location shift is not equal to 0

there we go! U = 16 when comparing total direct interaction time, and U = 28 when comparing mean time per visit.

in order to get a Z-value, we must assume U is approximately normally distributed. so let’s just say it is, even though the sample sizes fall a little bit short.

\[z = \frac{U - \mu_{U}}{\sigma_{U}}\]

where the mean of U \(\mu_{U} = \frac{n_{1}n_{2}}{2}\) and the standard deviation of U \(\sigma_{U} = \sqrt{\frac{n_{1}n_{2}(n_{1}+n_{2}+1)}{12}}\)

I’m going to write a function to calculate the Z-score at least, because I need to run two different values (and may have to deal with more later:

  • note that I tried to use the qnorm() function to find the Z-score. just trust me that it did not work.

  • i’m also going to use a continuity correction of 0.5, because the M-W test is categorical and my sample size is small. lol

mw.z <- function(U, n1, n2, correction = TRUE) {
  mean_U <- (n1*n2) / 2
  sd_U <- sqrt((n1*n2*(n1+n2+1))/12)
  correction <- if (correction) 0.5 else 0
  z <- (U - mean_U - correction) / sd_U
  print(z)
}
mw.z(16, length(FBV$TIME.VISIT), length(SCNP$TIME.VISIT))
## [1] -4.556526
mw.z(28, length(FBV$Mean_TIMEVISIT), length(SCNP$Mean_TIMEVISIT))
## [1] -4.18073

these are terribly close (minus the decimal point). maybe i’ll take out the continuity correction:

mw.z(16, length(FBV$TIME.VISIT), length(SCNP$TIME.VISIT), correction = FALSE)
## [1] -4.540868
mw.z(28, length(FBV$Mean_TIMEVISIT), length(SCNP$Mean_TIMEVISIT), correction = FALSE)
## [1] -4.165072

OH MY GOD the Z = -4.540868 is what we’re looking for because if you round it to the thousandth it’s -4.541 which is the number they state. except (obviously) their value is missing a decimal point but maybe that’s just a typo. THANK GOD. okay so they’re looking at the difference between the length of direct interaction(s) between the two groups. holyyyy

moving on to recreating figure one and figure two. they are pretty simple bar graphs.

  • as a note, they are only looking males in the SCNP group, so I’ll be pruning the data accordingly.
  • they also removed CIN, presumably because he had 0 probe events despite being in the SCNP group
library(ggplot2)
names(SCNP)
##  [1] "ID"                    "POPULATION"            "SEX"                  
##  [4] "GROUP.AGE"             "NUMVISIT"              "TIME.VISIT"           
##  [7] "Mean_TIMEVISIT"        "FINGERS"               "PROBE.EVENT"          
## [10] "SUCCESSFUL"            "NUMBER.OF.PROBE.TOOLS"
SCNP_males <- subset(SCNP, SEX == "MALE")
SCNP_males <- SCNP_males[SCNP_males$ID !="CIN", ]
SCNP_males # checking this
##     ID POPULATION  SEX GROUP.AGE NUMVISIT TIME.VISIT Mean_TIMEVISIT FINGERS
## 1  TOR       SCNP MALE     ADULT       53       7715         145.57       0
## 2  BEI       SCNP MALE     ADULT       35       9974         284.97       1
## 3  ZAN       SCNP MALE     ADULT       63      10663         168.78       5
## 4  NIC       SCNP MALE     ADULT       42      11793         280.79       0
## 5  ZEN       SCNP MALE     ADULT       38       6513         171.39       2
## 6  CLA       SCNP MALE     ADULT       27       5570         206.30       0
## 7  BLP       SCNP MALE       SUB       34       9010         265.00       0
## 8  CAP       SCNP MALE  JUVENILE       47       9504         202.21       2
## 9  LIM       SCNP MALE  JUVENILE       27       6219         230.33      11
## 10 COR       SCNP MALE  JUVENILE       53       9659         182.25       7
## 11 VOL       SCNP MALE  JUVENILE       64       9201         143.77      10
## 12 PAD       SCNP MALE  JUVENILE       59       8175         138.56      10
## 13 DES       SCNP MALE  JUVENILE       66       7281         110.32      13
##    PROBE.EVENT SUCCESSFUL NUMBER.OF.PROBE.TOOLS
## 1          489        457                    70
## 2          656        616                    52
## 3          655        632                    91
## 4          797        735                    89
## 5          548        518                    36
## 6          342        330                    29
## 7          635        601                    31
## 8          784        766                    78
## 9          278        258                    42
## 10         805        726                   103
## 11         262        156                    25
## 12         150         37                    44
## 13          22          4                    12

I’m also going to reorder individuals because I want to match the order they’re listed in within the paper.

SCNP_males$ID <- factor(SCNP_males$ID, levels = c("TOR", "BEI", "ZAN", "NIC", "ZEN", "CLA", "BLP", "CAP", "LIM", "COR", "VOL", "PAD", "DES"))
# plotting!
ggplot(data=SCNP_males, aes(x=ID, y=NUMBER.OF.PROBE.TOOLS)) +
  geom_bar(stat="identity", fill="red", width = 0.5) + 
  labs(x = element_blank(), y = "no. probe tools", caption = "Figure 1. Number of sticks used by each monkey of the SCNP group.") +
  theme(plot.caption = element_text(hjust = 0)) +
  scale_y_continuous(breaks = seq(0, 100, by = 10), expand=expansion(mult=c(0,0.01))) 

and here is the original Figure 1 from the paper:

Figure 1. Number of sticks used by each monkey of the SCNP group. All males; females did not use probes.
Figure 1. Number of sticks used by each monkey of the SCNP group. All males; females did not use probes.

this makes the fact that the figures are inaccurate MUCH more noticeable - the researchers’ graphs definitely aren’t to scale, and it seems to outright plot some values incorrectly?

Figure 2 replication:

# first, calculating proportions of successful probing:
SCNP_males$PROBE.SUCCESS <- SCNP_males$SUCCESSFUL/SCNP_males$PROBE.EVENT
SCNP_males$PROBE.SUCCESS # looks fine
##  [1] 0.9345603 0.9390244 0.9648855 0.9222083 0.9452555 0.9649123 0.9464567
##  [8] 0.9770408 0.9280576 0.9018634 0.5954198 0.2466667 0.1818182
# plotting:
ggplot(data=SCNP_males, aes(x=ID, y=PROBE.SUCCESS)) +
  geom_bar(stat="identity", fill="navy", width = 0.5) + 
  labs(x = element_blank(), y = "proportion of successful probing", caption = "Figure 2. Proportions of successful probing by each monkey of the SCNP group.") + 
  theme(plot.caption = element_text(hjust = 0)) +
  scale_y_continuous(breaks = seq(0, 1.0, by = 0.2), expand = expansion(mult = c(0,0.05))) 

and here is the original Figure 2:

Figure 2. Proportions of successful probing by each monkey of the SCNP group. All males; females did not use probes.
Figure 2. Proportions of successful probing by each monkey of the SCNP group. All males; females did not use probes.

My y-axis intervals aren’t entirely accurate, but I can’t comprehend why the original figure lists 0, 0.3, 0.5, 0.8, and 1.